Foundation models in molecular biology

Determining correlations between molecules at various levels is an important topic in molecular biology. Large language models have demonstrated a remarkable ability to capture correlations from large amounts of data in the field of natural language processing as well as image generation, and correlations captured from data using large language models can also be applicable to solving a wide range of specific tasks, hence large language models are also referred to as foundation models. The massive amount of data that exists in the field of molecular biology provides an excellent basis for the development of foundation models, and the recent emergence of foundation models in the field of molecular biology has really pushed the entire field forward. We summarize the foundation models developed based on RNA sequence data, DNA sequence data, protein sequence data, single-cell transcriptome data, and spatial transcriptome data respectively, and further discuss the research directions for the development of foundation models in molecular biology.


INTRODUCTION
Interactions between biomolecules (such as metals, proteins, lipids, and nucleic acids) involve in numerous life processes at different spatial scales (Fig. 1A), which are essential for the maintenance of normal life activities (Limo et al. 2018;Nooren and Thornton 2003;Tiwari and Chakrabarty 2021;Jankowsky and Harris 2015).For example, interactions between residues determine the folding path of proteins and the structure formed by folding, and misfolding can lead to abnormal protein function (Dobson 1999;Hartl 2017); interactions between proteins are essential for intercellular signaling and intracellular catalysis (Henderson and Pockley 2010;Zheng et al. 2023).Decoding the interaction networks of biomolecules is a central challenge in the field of molecular biology, and a comprehensive understanding of the interaction network of biomolecules will not only dramatically advance the understanding of life processes and the treatment of diseases, but will also enable the construction of numerical models of biological systems that are capable of precise biomolecular experimentation.
There are various types of interactions between biomolecules, such as protein-protein interactions, RNA-small molecule interactions, etc., and numerous approaches have been used to characterize the interactions between biomolecules (Gao et al. 2023;Lenz et al. 2021;Mann et al. 2017;Sledzieski et al. 2021;Umu and Gardner 2017).Biological experiments are commonly used to characterize the interactions between biomolecules (Bai et al. 2015;Nguyen et al. 2016;Rual et al. 2005), and the combination of high-throughput and low-throughput experiments has generated a lot of valuable data, for example, 3.7 million pairs of RNA-RNA interactions discovered by experiments have been stored in the starBase (Li et al. 2014) database, and BioGRID (Oughtred et al. 2021) database has also stored 2.7 million pairs of protein-protein interactions discovered by experiments.However, it is estimated that experimentally discovered interactions between biomolecules still represent only a small fraction of all possible interactions (Lu et al. 2020;Ramanathan et al. 2019).Computational approaches are important complements to experimental approaches in determining whether interactions exist between biomolecules, and tend to have a significant advantage in speed over experimental approaches, while accuracy may have some limitations (Cirillo et al. 2012;McDowall et al. 2009;Puton et al. 2012;Rao et al. 2014).Deep learning approaches are the significant breakthrough in computational approaches, which are good at learning interaction patterns from existing interactions between biomolecules and then applying such learned knowledge to explore undiscovered interactions between biomolecules (Gao et al. 2023;Li et al. 2022b;Singh et al. 2022).The performance of deep learning approaches has been greatly improved compared to traditional approaches, but the lack of interaction data has also limited the performance of deep learning approaches.
In contrast to the seriously scarce task-specific data, the huge amount of unlabeled data is another yet-to-beexplored treasure within the field of molecular biology (including protein sequences, DNA sequences, RNA sequences, single-cell transcriptome data, etc., see Table 1), for example, there are only about 500,000 experimental protein structures (as determined by residue-residue interactions) in the Protein Data Bank (Goodsell et al. 2020), whereas the number of protein sequences contained in the BFD protein sequence database is already 2.5 billion (Jumper et al. 2021).These unlabeled data are "snapshots" of the interactions between biomolecules, protein sequences reveal which residues are arranged in which order to fold into a stable protein structure, while single-cell transcriptome data imply the regulatory relationships between genes.How to distill the correlations between biomolecules from these unlabeled data is another important question, and this is an area in which language models can specialize.Language modeling has been very widely used in molecular biology after its great success in the field of natural language processing and has led to the research paradigm of "pre-training + finetuning" (Bepler and Berger 2021;Devlin et al. 2019;Dodge et al. 2020;Vaswani et al. 2017;Wang et al. 2023e).In this review, we first briefly describe the architectures of common language models, then report the performance and application scenarios of language models developed based on RNA sequence data, protein sequence and structure data, and single-molecule transcriptome data, and finally discuss the next steps in the development of language models in molecular biology.

LANGUAGE MODELS
Understanding words or phrases in their context is a critical challenge in natural language processing, which has been greatly facilitated by the introduction of deep learning, especially large language models.Large language models usually adopt Long Short-Term Memory (LSTM.Pdf n.d.) (LSTM) or Transformer (Vaswani et al. 2017) as the backbone network, which is trained with self-supervised learning on a large amount of unlabeled text.The central concept of selfsupervised learning is to use the data itself to generate labels and there are two common approaches of selfsupervised learning in use today, one is to randomly mask a portion of the text and then use the unmasked portion to predict the content of the masked portion (Devlin et al. 2019;He et al. 2020;Joshi et al. 2018), and the other is to predict what the next word or phrase will be from the previous text (Brown et al. 2020;Radford et al. 2018Radford et al. , 2019)).If the model has the ability to predict the content of the masked portion or what the next word will be from the existing text, then it means that the model does capture the correlations between words and to some extent can understand the meaning of a word in their context.BERT (Devlin et al. 2019), ESM-1b (Rives et al. 2021) and other works (Brown et al. 2020;Cui et al. 2020;Dong et al. 2019;Radford et al. 2018) have proved that large language models do have certain ability to predict the content of the masked portion in the text, while the embedding of words extracted from large language models has been found to contain the contextual context of the corresponding word in the work of Peters et al. (Peters et al. 2018).Language models trained with the objective of recovering the content of the masked region are superior in text comprehension, and language models trained with the objective of predicting the next word excel in text generation (Ethayarajh 2019;Klein and Nabi 2019).While the performance of language models for different network architectures tends to have some differences, the BERT (Devlin et al. 2019) and GPT (Radford et al. 2018(Radford et al. , 2019) ) architectures are currently the most widely used language model architectures for their excellent performance in the tasks of understanding text and generating new text.First, we first introduce Transformer, and then describe the architecture and training approaches of BERT and GPT.

Transformer architecture
Transformer (Vaswani et al. 2017) uses an encoderdecoder architecture and achieves excellent performance on machine translation tasks.Where the encoder is used to convert the input sequence into a continuous representation, the decoder uses the output of the encoder as a condition to sequentially predict the words in the translated sentence.Each layer in the encoder and decoder consists of a multi-head attention module and a feed-forward module, the "Scaled Dot-Product Attention" in the multi-head attention module ensures that the encoder considers the entire input when processing each element, and the following is the formula of "Scaled Dot-Product Attention": where denotes the input; , , denote the query, key and value transformed from the input; are the parameters to be learned.
Multi-head attention is based on "Scaled Dot-Product Attention" to increase the representation capability of the model by mapping the input sequences to different attention spaces: where are the parameters to be learned.

BERT architecture
BERT (Devlin et al. 2019) is a multi-layer bidirectional language model obtained by stacking Transformer's (Vaswani et al. 2017) encoders.As shown in Fig. 1B, given a sequence containing L words , recovering the content of the masked portion of the sequence is the training objective of BERT.Using the i th word masked as an example, then BERT is trained with the training objective of maximizing the following likelihood: , where denotes the word predicted by BERT after masking the i th word of that.

GPT architecture
GPT (Radford et al. 2018(Radford et al. , 2019) is a multi-layer and unidirectional language model obtained by stacking the Transformer's decoder, and similar to the encoder introduced in the previous section, each decoder also consists of a multi-head attention module and a forward propagation module.Predicting the next word from the previous text is the training objective of GPT (see Fig. 1C), and the multi-head attention layer ensures that GPT can consider all the previous text when making predictions.Given a sequence containing L words , then GPT is trained with the training objective of maximizing the following likelihood: , where denotes the (i + 1) th word predicted by GPT after considering the previous i words.

LANGUAGE MODELS FOR PROTEINS
Proteins are biological macromolecules composed of hundreds or thousands of amino acids (amino acids within proteins are often referred to as residues due to dehydration condensation), and the interactions between residues drive the folding of proteins into specific structures, which in turn perform specific functions (Kim et al. 2014).Given the importance of protein structure, countless approaches have been proposed over the past decades to advance the problem (Ding et al. 2018;Golkov et al. 2016;He et al. 2017;Jones et al. 2015;Ju et al. 2021;Wang et al. 2017;Xu 2019;Yang et al. 2020).Among them, the approaches that utilize mutual information, direct coupling analysis, and other tools to derive residue interactions from multi-sequence comparisons, and to predict protein structure from residue interactions using tools such as PyRosetta (Chaudhury et al. 2010), CNS (Brunger 2007), and others, have achieved remarkable success and have become the dominant paradigm for protein structure prediction (Senior et al. 2020;Wang et al. 2017;Yang et al. 2020).The methods of predicting residue interactions with the help of deep learning such as residue network (He et al. 2016) are the latest advances in this paradigm, but they are still far from solving the problem of protein structure prediction, while the introduction of language models has pushed the problem of protein structure prediction to be basically solved (Baek et al. 2021;Jumper et al. 2021;Lin et al. 2023) (the paradigms for protein structure prediction are illustrated in Fig. 2).Protein language models trained with a large number of protein sequences are able to capture the interactions between residues in protein sequences very well, and have already demonstrated very powerful capabilities in other downstream tasks such as protein structure prediction and protein function prediction.In addition to protein understanding, protein language models have also demonstrated excellent generative capabilities, which are very important for protein design problems such as protein sequence generation.We introduce protein language models below, which are focused on protein understanding (protein sequence modeling) and protein sequence generation.

Protein sequence modeling based on protein language model
Sequence modeling has been a long-standing research problem in the domain of natural language processing, and advances in the NLP domain have shown that language models trained on huge amounts of unlabeled sequences, especially those based on the Transformer architecture, have a very good ability to model sequences.This success quickly extended to other research domains, and protein science was a pioneer in applying language models.Early protein language models for protein sequence modeling were mainly trained on protein sequence datasets in the form of

Downstream tasks
Masked language model

Loss Loss
Base-base interaction DNA/RNA μm Cell Tissue cm Fig. 1 Overview of molecular interactions and foundation models.A Types of molecular interactions at different spatial scales.B Processes for representing data as embeddings using foundation models and using the embeddings for downstream tasks.Where xi denotes the i th element in the data X, hi denotes the embedding corresponding to the ith element.C Masked language model learns the correlation between elements in the data by masking a portion of the elements in the data (denoted as M) and then using the remaining portion to predict the masked elements, the difference between the predicted value of the masked portion and the true value is used to update the model.D Autoregressive language model learns the correlation between elements in the data by sequentially predicting the next element in the data from the beginning (denoted as S), and the difference between the predicted and true value of the next element is used to update the model and Hoehndorf 2020) applied protein language models to protein function prediction, and all achieved favorable results.
Compared with single protein sequences, homologous sequences in multiple sequence alignments contain rich evolutionary information that can greatly assist the inference of residue interactions; therefore, compared with protein language models based on single protein sequences, protein language models based on multiple sequence alignments may be more capable of capturing residue interactions.MSA-Transformer (Rao et al. 2021) is, to the best of our knowledge, the first protein language model trained based on multiple sequence alignment, which is built primarily from axis-attention based modules and is also trained with the objective of recovering the content of masked regions.The analysis results show that MSA-Transformer significantly outperforms ESM-1b in capturing residue interactions and achieves the best performance on the task of protein residue contact prediction.A-Port (Hong et al. 2022) performs residue contact prediction using MSA-Transformer and inputs the predicted pairs of contacting residues into PyRosetta for protein structure prediction.The analysis results show that the quality of structures predicted by A-Port exceeds the current best structure prediction methods, but it is still far from solving the problem of protein structure prediction.The emergence of AlphaFold2 (Jumper et al. 2021) has virtually solved the problem of structure prediction for proteins, and results at CASP14 show that for most proteins, the quality of the structure predicted by AlphaFold2 is comparable to the quality of the experimentally resolved structure.AlphaFold2 is a protein language model in an encoder-decoder architecture, where the encoder consists of a stack of 48 EvoFormer modules to extract the representation of multiple sequence alignments and explicitly predict the spatial distance between residues.The decoder, or structure module, consists of eight layers stacked on top of each other, which is used to generate the protein structure from the MSA representation.Specifically, the decoder initializes the spatial position of each residue in the protein at the origin, and each subsequent layer updates the protein structure with the sequence representations and residue distances from the encoder.

Protein sequence generation based on protein language model
Generating protein sequences from scratch and generating constraint-compliant protein sequences are the two main application scenarios for protein sequence generation.Currently, although the Uniref100 protein sequence database (Mirdita et al. 2017) already contains about 250 million protein sequences, these protein sequences only account for a very small portion of the protein sequence space, so if foldable protein sequences can be generated computationally and rapidly, it can provide more options for fields that can use proteins, such as catalysis or pharmaceuticals, etc. ProtGPT2 (Ferruz et al. 2022) is a protein language model trained on 45 million protein sequences with the training goal of predicting the next word based on the current sentence.The training goal of ProtGPT2 makes ProtGPT2 naturally suitable for generating protein sequences from scratch.Analysis of the protein sequences predicted by ProtGPT2 showed that the proportion of disordered structures and amino acid frequencies are almost the same as the natural sequences, indicating that ProtGPT2 has the ability to generate protein sequences similar to the natural protein sequences.RITA (Hesslow et al. 2022) explored the effect of the scale of protein language model on the generative ability by training a series of protein language models of different scales with the objective of the next word prediction, and the results showed that the larger the scale of the language model, the higher the reliability of the generated protein sequence.In addition to this, Robert et al. (Verkuil et al. 2022) also explored the use of a masked protein language model to generate protein sequences and experimentally verified that the generated sequences have a higher probability (67%) of being soluble.ProtGen (Madani et al. 2023) is a representative work in generating protein sequences under finite constraints, which is also a protein language model with the training objective of predicting the next word.Compared with other protein language models, ProtGen can specify the function of the protein and then generate protein sequences that match the function, and experiments show that the protein sequences generated by ProtGen can realize some functions better than natural sequences and have lower similarity with existing natural protein sequences.

LANGUAGE MODELS FOR GENOMICS
DNA and RNA are also important biomacromolecules in organisms like proteins.DNA mainly serves to encode genetic information, and interpreting DNA with the help of language modeling is a field of research that has emerged in the last two years; whereas for RNA only about 5% of all RNA transcripts are mRNAs coding for proteins, the remaining portion called non-coding RNAs exercise functions such as signaling and gene regulation, etc. (Wang and Chang 2011).Non-coding RNAs can perform specific functions only if they can maintain specific structures, but the severe scarcity of RNA structural data in the field of RNA has limited the performance of RNA structure prediction methods.In contrast to structural data, RNA sequence data has been accumulated with the development of RNA sequencing technology, and the structure of RNA is determined by the interactions between nucleotides, so how to distill the interactions between nucleotides with the help of the huge amount of RNA sequence data has become an important issue, and this is the area where language modeling specializes in.
Developing DNA language models and RNA language models are rising research areas, the development of DNA/RNA language models as well as their applications will be described below (see Fig. 3).

DNA sequence modelling based on the DNA language model
DNABERT (Ji et al. 2021) is, to the best of our knowledge, the first DNA language model using the BERT architecture, specifically, DNABERT uses the human genome as the training data and the k-mer representation of DNA as words for training (Take the DNA sequence "ATGGCT" as an example, the 3-mer representation used by DNABERT will represent the sequence as {ATG, TGG, GGC, GCT}).The excellent performance of DNABERT in predicting proximal and core promoter regions and identifying transcription factor binding sites fully demonstrates the potential of language models in the field of DNA research.In contrast to DNABERT, which was trained using only the human genome, Nucleotide Transformer (Dalla-Torre et al. 2023) was trained using the genomes of 850 species and showed excellent performance in detecting genetic variants and predicting the effects of mutations.DNABERT-2 (Zhou et al. 2023b) is an upgraded version of DNABERT, which not only proposes a simple and effective scheme for DNA tokenization, but also dramatically improves the training efficiency by adopting techniques such as Flash Attention.In addition, representative work using a DNA foundation model for CRISPR sgRNA design, i.e., DeepCRISPR (Chuai et al. 2018), was presented.nucleotide distance prediction, secondary structure prediction, etc. show that the prediction performance using RNA-FM is better than that using only RNA sequences, suggesting that RNA-FM captures partial nucleotide interactions.Uni-RNA (Wang et  Compared to single sequences, there are also some RNA language models developed based on the MSA of RNA.RNA-MSM (Zhang et al. 2023) adopts the MSA-Transformer architecture and uses 3932 MSAs for training, and outperforms traditional algorithms in water solubility prediction as well as secondary structure prediction tasks, which proves the application value of RNA language model.In addition, works such as trRosettaRNA (Wang et al. 2023d), DRfold (Li et al. 2023), and RoseTTAFoldNA (Baek et al. 2024) used a similar architecture to the encoder of AlphaFold2 to process MSA for RNA structure prediction, and also achieved certain results.

LANGUAGE MODELS FOR SINGLE CELL TRANSCRIPTOMES
Cells are the basic units of life, the complex regulatory relationships between intracellular genes determine the behavior and function of cells, and the complex interactions between various types of cells in an organism realize more advanced life activities.Deciphering the intracellular regulatory network between genes and the communication network between cells in an organism is extremely crucial for analyzing the differences between different types of cells and understanding the life process, and the development of single-cell transcriptome sequencing technology has dramatically advanced this process (Kolodziejczyk et al. 2015;Jovic et al. 2022).The transcriptome is the total of the transcription products of all genes in a cell under specific spatial and temporal conditions, which determines the specificity of the cell, and it is also the result of complex intra-and intercellular regulatory relationships.Single cell transcriptome sequencing technology has accumulated a large amount of single cell transcriptome data in the past decade (Cao et al. 2017;Moreno et al. 2022), and there are numerous algorithms tried to decipher the mystery of intracellular gene regulation and intercellular communication with the help of single cell transcriptome data (Bafna et al. 2023;Dai et al. 2019;Iacono et al. 2019;Wang et al. 2023c).Recently, transcriptome language models have made great progress in capturing gene regulatory relationships (Cui et al. 2023;Theodoris et al. 2023;Wen et al. 2023;Yang et al. 2022), and have gradually become the main method to analyze single cell transcriptome data (see Fig. 4).In addition, transcriptome language models have also shown very good performance in cell type identification, gene expression prediction and other tasks.In the following, we will introduce the training approaches and applications of transcriptome language models.
The transcriptome of a single cell contains both gene types and corresponding gene expressions, an ideal transcriptome language model should have the ability to capture the causal relationships between all elements (gene types, gene expressions) in the transcriptome, while the ability of the model is closely related to the design of the model's training objective.Earlier transcriptome language models were mainly trained based on recovering the content of the masked region as the training objective, but there are some differences in the way of masking.scBERT (Yang et al. 2022) uses the Performer module to build the model, which is capable of handling longer sequences than the standard Transformer.In addition, scBERT was trained using the Panglao human single-cell transcriptome dataset (containing about one million transcriptomes) by masking a portion of the expression of a gene in the transcriptome (with non-zero expression) and then predicting the expression of the masked portion.scBERT achieves the best performance on the tasks of cell type annotation and identification of novel cells, which indicates that the model captures cell specificity.Compared to scBERT, which only aims at recovering the gene expression in the masked region, scFormer's (Cui et al. 2022) training objective includes both recovering the gene expression in the masked region and recovering the gene type in the masked region, and it also achieves good performance on tasks such as gene perturbation as well as batch effect correction.Gene expression can fluctuate widely, and gene expression can also contain overall noise due to batch effects, etc. Geneformer (Theodoris et al. 2023) has designed a new type of training objective to train the transcriptome language model, specifically, Geneformer will first sort the genes in the transcriptome according to their expression, and then, after masking the genes randomly, it will set the training objective to predict the types of genes at the masked positions, which cleverly uses the information of genes and expressions, and also eliminates the noise problem in the expressions.The analysis results show that Geneformer can handle batch effects well and performs well on tasks such as network dynamics prediction as well as gene perturbation prediction, suggesting that Geneformer learns the regulatory relationships between genes from the transcriptome well.scFoundation (Hao et al. 2023) considers that the vast majority of genes in the singlecell transcriptome are not expressed (expression is zero), and complete processing of all genes and expression will greatly affect the inference speed of the model as well as the scale of the trainable model; therefore, an asymmetric encoder-decoder language model architecture was designed, in which the encoder module only processes genes with an expression not zero.This architecture allows scFoundation to reach a scale of 100 million parameters and outperforms pretrained models such as scBERT and Geneformer.
In addition to transcriptome language models that are trained with the objective of recovering the content of masked regions, work such as scGPT (Cui et al. 2023) as well as scTranslator (Liu et al. 2023) have explored the application of generative language models in the transcriptome.scGPT is trained to sequentially predict the expression of genes with unknown expression based on the known gene expression and cell type, and thus the model has the ability to generate the transcriptome of an entire cell while only the cell type is specified.scTranslator, on the other hand, is a generative transcriptome language model trained to infer protein abundance values.scTranslator can predict the proteome of a single cell given that cell's transcriptome, and analysis has shown that the interactions between proteins (genes) inferred by scTranslator are relatively accurate.

GRAPH NEURAL NETWORKS ON SPATIAL TRANSCRIPTOMICS
Recent advances in spatially-resolved transcriptomics (ST) technologies have enabled telescoped investigation of in situ gene expression and spatial location of cells in tissues.The spatial transcriptomics data profiles cell type structure, gene expression with spatial pattern and cell-to-cell interactions in spatial perceptions.This knowledge is essential for understanding and explaining complex life systems, i.e., disease progress (Ye et al. 2022;Chen et al. 2020), tumor micro-environment (Zhu et al. 2022;Ferri-Borgogno et al. 2023)  Although ST provides revolutionized data of tissue, it's challenged by barriers from intrinsic noise, highsparseness, and multimodality (gene expression matrices, spatial coordinates and histology images).The main task of analyzing ST datasets includes the detection of spatial domain and variable genes (SVGs), cell type decomposition and data augmentation.Besides, three-dimensional (3D) cellular structure construction is required to better understand the biological process in the whole organ and organism.In order to accomplish these needs, lots of computational methods have been developed.Graph neural networks (GNNs) have attracted much attention in recent articles (Wu et al. 2019;Liu et al. 2024).Unlike other common methods which failed to utilize the spatial coordinates and histology image information, GNNs enable learning from a bucket of gene expression data, spot spatial coordinates, i.e., graph neighborhood network, and histology image.GNNs are generally self-supervised or semi-supervised models, as shown in Fig. 5, the GNNs utilized in ST methods can be generally divided into four categories, i.e., graph convolutional network (GCN), graph attention network (GAN), graph generative network and graph autoencoder.Compared with other models, these GNNs can learn and preserve the relative information in spatial location and image data, which makes them outperform in many tasks such as spatial domain detection, cell type decomposition and 3D tissue construction.
As mentioned before, due to the low capture efficiency and high technology noise in ST data, data augmentation (imputation, denoise) is a key task in ST data analysis.For this task, one kind of method is to integrate scRNA-seq data with ST, such as stPlus (Chen et al. 2021b) and SpaGE (Abdelaal et al. 2020).However, doing so might induce new bias and unwanted noise due to the unpaired samples and technology differences.Another kind of method mainly considers the ST data itself and usually makes the augmentation with the neighborhood structure of ST spots, which is associated with spatial location.In this situation, GNN-based methods can be appealing, i.e., SEDR (Fu et al. 2021), stMVC (Zuo et al. 2022) and SiGra (Tang et al. 2023).SEDR is an unsupervised model that integrates transcriptomics data and associated spatial information.It first constructs a lowdimension latent representation of the ST matrix through a deep autoencoder, and then combines it with the corresponding spatial loci information by a variational graph autoencoder.The SEDR pipeline performed well on human dorsolateral prefrontal cortex data, and was able for batch correction.stMVC is a muti-modal model method that integrates gene expression matrix, spatial location, histology image and region segmentation.It applied a semi-supervised graph attention autoencoder to capture the structure of ST data, and the whole model can elucidate intratumoral heterogeneity in ST data.SiGra was designed to denoise gene expression data in ST.A graph transformer was used to leverage the rich information in the spatial distribution of spots and cells, and the inclusion of immunohistochemistry images by imagingtranscriptomics hybrid architecture can help improve the performance by 37%.
Deciphering spatial domains and SVGs is critical for understanding the biological structure and function of tissue.In this task, models must consider the spatial location of cells and gene expression.SpaGCN (Hu et al. 2021) applied a graph convolutional network (GCN)based approach to detect spatial domain and SVGs.The spatial domain detection is based on the weighted graph built on gene expression and histology image and spatial location, and then SVGs are calculated on spatial domains.STAGATE (Dong and Zhang 2022)  3D construction of whole tissue or organs can accelerate the understanding of disease processes and organogenesis.Since one individual ST slice contains gene expression information on a 2D plane, the 3D construction of tissue requires the integration of multiple slices.There are several methods for integrating parallel ST slices and 3D construction, i.e., PASTE (Zeira et al. 2022), STAligner (Zhou et al. 2023a) and Stihchi3D (Wang et al. 2023b).PASTE mainly aligns spots in different slices based on their gene expression similarity and spatial distances, using an optimal transport algorithm.STAligner develops a graph attention autoencoder to learn spot embeddings with gene expression and spatial location information.The later alignment is based on the embedding and shared spatial domain between slices.Stihchi3D is a joint model for 3D domain detection and cell-type decomposition of ST.A graph attention network is utilized to learn the representation of spots' gene expression and 3D spatial adjacent network.
In summary, ST contains multi-modal data, i.e., gene expression, spatial locations and histology image, which requires full usage of this information.GNNs are efficient at capturing relative information from networkstyle data.While dealing with noisy and sparse ST data, GNNs have great potential in solving tasks including data augmentation, spatial domain and SVGs detection, cell type decomposition and 3D construction of tissue.

DISCUSSION
Foundational models in molecular biology are shaping new research approaches in the field, in this review we provide a comprehensive summary of foundational models in molecular biology, detailing their architecture, training approaches, scope of application, and how they are used.Noted that although significant achievements have been made by foundational models in molecular biology, most current language models are based on specific types of biological data, and crossmodal foundational models of greater value are still relatively rare.Another important issue regarding the foundation model is its relationship with "small sample learning", i.e. the few-shot learning using relatively small training samples (Long et al. 2023).It should be noted that the "fine-tuning" strategy used in the foundation model is actually targeted to address the small sample issue in the specific downstream tasks.However, a recent study indicated that the foundation model may fail in the zero-shot scenario, which is an extreme case of few-shot learning (Zeira et al. 2022) in which no training data are available for the specific tasks.For such low-data-resource learning cases, various few-shot learning schemas, for example, the meta learning has been proposed (Zhou et al. 2023a).Several applications using meta learning to address molecule analysis problems, for example, the pMHC-TCR interaction recognition (Wang et al. 2023b) and kinome-wide polypharmacology profiling have been presented (Benegas et al. 2023).
Finally, life processes are often dynamic, and multimodal foundational models that can take into account the spatio-temporal specificity of biological data may be able to make the digital cell a reality.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Fig. 2
Fig. 2 Frameworks for protein structure prediction.A Traditional paradigm for protein structure prediction.B MSA-based end-to-end protein structure prediction.C Protein structure prediction from a single sequence based on protein language model

Fig. 3
Fig. 3 Applications of foundation models in RNA science.Both sequence-based and MSA-based RNA foundation models can be applied to downstream tasks such as RNA secondary structure prediction, RNA structure prediction, etc.

Fig. 4
Fig.4Applications of foundation models in single cell transcriptomes.BERT-style and GPT-style foundational models take different forms of single cell transcriptome data as input and can be applied to downstream tasks such as cell type annotation and chromatin dynamics prediction, where G denotes gene and E denotes expression developed a graph attention autoencoder framework to identify spatial domains.The graph attention autoencoder learns to integrate gene expression and spatial location information, and adopts a graph attention mechanism when considering spatial neighbor information.STAGATE performed well in the accuracy of spatial domain and SVGs detection.CCST (Li et al. 2022a) is an unsupervised cell clustering method based on GCN.The cell cluster results provided by CCST can help identify curate cell type and then spatial domain.Spatial-MGCN (Wang et al. 2023a) adopted a multi-view GCN encoder to extract unique embeddings from gene expression and spatial location graphs.The incorporation of this information in Spatial-MGCN helps it outperform in spatial domain detection.The resolution of the majority ST technologies has not reached a single-cell level, thus decomposition of cell type in ST data is commonly needed.There are lots of methods designed for ST cell type decomposition utilizing scRNA-seq as a reference, i.e., cell2location (Kleshchevnikov et al. 2022), SPOTlight (Elosua-Bayes et al. 2021) and Tangram (Biancalani et al. 2021).The spatially nearby spots are more likely to share similar cell components, thus leveraging spatial location by GNNs could improve cell-type decomposition performance.DSTG (Song and Su 2021) adopts GCN to learn the latent representation of both gene expression and spatial locations of spots, and later applied decomposition on the latent representation matrix.GraphST (Long et al. 2023) is a graph self-supervised contrastive learning method.A GNN accompanied by augmentationbased self-supervised contrastive learning is used to learn representations of spots in GraphST.

Fig. 5
Fig. 5 Overview of graph neural networks on spatial transcriptomics